Machine learning-based prediction of relapse in rheumatoid arthritis patients using data on ultrasound examination and blood test

Recent effective therapies enable most rheumatoid arthritis (RA) patients to achieve remission; however, some patients experience relapse. We aimed to predict relapse in RA patients through machine learning (ML) using data on ultrasound (US) examination and blood test. Overall, 210 patients with RA in remission at baseline were dichotomized into remission (n = 150) and relapse (n = 60) based on the disease activity at 2-year follow-up. Three ML classifiers [Logistic Regression, Random Forest, and extreme gradient boosting (XGBoost)] and data on 73 features (14 US examination data, 54 blood test data, and five data on patient information) at baseline were used for predicting relapse. The best performance was obtained using the XGBoost classifier (area under the receiver operator characteristic curve (AUC) = 0.747), compared with Random Forest and Logistic Regression (AUC = 0.719 and 0.701, respectively). In the XGBoost classifier prediction, ten important features, including wrist/metatarsophalangeal superb microvascular imaging scores, were selected using the recursive feature elimination method. The performance was superior to that predicted by researcher-selected features, which are conventional prognostic markers. These results suggest that ML can provide an accurate prediction of relapse in RA patients, and the use of predictive algorithms may facilitate personalized treatment options.

Prediction of relapse in RA patients using researcher-selected or RFE-selected features. Next, we applied the recursive feature elimination (RFE) selection algorithm to the prediction to remove weak features and improve the prediction performance. We also selected ten features (gender, disease duration, age, wrist SMI score, MTP SMI score, ESR (1 h), CRP, RF, anti-CCP, and MMP-3) typically associated with disease activity and prognosis in RA patients and compared the results. The best Logistic Regression and Random Forest models utilized 20 and ten RFE-selected features for the best XGBoost model (Supplementary Table 2). RFE-selected features are shown in Supplementary Table 3. AUCs, accuracies, precisions, recalls, and F1-scores were higher in the prediction using RFE-selected features than that using researcher-selected features (Table 3). Among the three ML models, XGBoost showed the highest prediction result (AUC = 0.747, Fig. 2), and the AUC was also higher than the prediction using all data (Tables 2, 3). In the prediction by XGBoost, ten features, including four US examination data, five blood text data, and a piece of patient information were selected (Table 4). These results suggest that RFE-selected features are suitable for prediction in ML, compared with researcher-selected   (Fig. 3A). Furthermore, the ten features' value was compared between patients with remission and relapse (Fig. 3B). Consequently, wrist and MTP SMI scores were significantly higher in patients with relapse. However, alanine aminotransferase (ALT) and height were significantly lower in patients with relapse. There were no significant differences in the remaining six features. The comparison result of all features between patients with   24 was applied to the standardized input data (Fig. 3C). The RFE selected features were diverse in the embedding space, and it implies the prediction of relapse is made by combining SMI scores with various features. These results suggest that all type of features, especially US data, are important for predicting relapse in RA patients. In addition to the features with significant differences between patients with remission and relapse, those with no significant differences may also contribute to the prediction.

Discussion
We studied relapse prediction in RA patients through ML using data on US examination and blood test. A combination of US examination and blood test data showed higher AUCs than those calculated using individual data. The result is not surprising because the input of more features generally improves prediction. Next, we used RFE to remove weak features and improve the prediction performance. The prediction using RFE-selected features showed higher performance than that using researcher-selected features, although the number of selected features was the same. The result suggests that RFE uncovered an optimal combination of features for better prediction. Among the ten features selected by RFE in XGBoost, wrist and MTP SMI scores were the top two vital features, suggesting that US data significantly improved prediction of relapse in RA patients. Three features (wrist and MTP SMI scores and ESR) were also included in the researcher-selected features. Wrist and MTP SMI scores were reported as prognostic factors 12,13 , and ESR is one of RA's most fundamental inflammation markers 1,4 . In the remaining seven features, ALT and height were significantly lower in patients with relapse. ALT is not wellcharacterized as a prognostic factor, but the elevation is a marker of liver toxicity in RA treatment 25 . There is a possibility that patients with lower ALT may receive lower-intensity therapies, contributing to higher relapse risk. Height is also uncommon as a prognostic factor in RA; however, there is a study that adult height is inversely associated with disease activity 26 , which is compatible with the result. There were no significant differences in six features between patients with remission and relapse. The comparison is a univariate analysis of the total cohort. Therefore, information on the association among features and prognostic significance in patient subgroups is lacking. Further studies on the importance of these features, including underlying biological mechanisms, are    Table 4). This raises the possibility that other relative features with more importance are alternatively selected by RFE. Among the three ML models, XGBoost, a scalable, distributed gradient-boosted decision tree ML library, achieved the best performance (AUC = 0.747). The model has gained much attention recently due to its superior performance 27,28 , which is compatible with the prediction results in this study. Because the decision tree-based model is adequate for data sets containing various features, Random Forest and XGBoost showed more accuracy than Logistic Regression for mixed data. XGBoost algorithm selects one feature when there is a high correlation between variables, whereas Random Forest randomly selects a feature and learns the correlations of different features across the model. Therefore, XGBoost was considered more accurate in feature selection because it could select a smaller number and more efficient features. In our previous study analyzing almost the same cohort without using ML, the highest AUC was 0.67 for predicting relapse 12 , suggesting that ML using US examination and blood test data improved prediction results. This study's sample size (n = 210) is typical among previous studies on ML applications to autoimmune diseases 17 . However, larger sample size could improve prediction. In this study, the follow-up period was 2 years, and the results may vary according to follow-up duration. Therefore, the results should be validated in studies conducted in larger populations with multiple follow-up times. Recent studies showed the possible application of ML to the measurement of US/X-ray images 29,30 . A combination of such technologies and our ML model can be a promising approach for convenient and better prediction of relapse.
In conclusion, we established an improved model for predicting relapse in RA patients through ML. The combination of data on US examination and blood test was a unique approach of this study, and US data were shown to be essential for prediction. The findings may lead to a better assessment of relapse risk and enable the selection of personalized treatment strategies for RA patients. Data collection. The US was examined using an Aplio500 (Canon Medical Systems) fitted with a 12 MHz linear probe (18L7). Bilateral joints (second through fifth metacarpophalangeal (MCP), radial wrist, ulnar wrist, second through fifth MTP, Lisfranc, cuneonavicular, Chopart, and ankle) were examined as described previously 31 . The scanning technique and interpretation of lesions were based on Outcomes Measures in Rheumatology (OMERACT) 32 . The former of the two SMI modes (color-coded and monochrome SMI) was used for this study. Regions of interest for SMI were fixed at the same size and depth for each joint type. Under the established four-point scale (0-3) semi-quantitative scoring system 33 , gray scale (GS) and SMI scores were determined on-site by at least two of five sonographers with 1-9 years of experience, and agreement was obtained in weekly meetings attended by all five sonographers. The scores for each group of joints were summed as follows: MCP, bilateral second through fifth MCP; wrist, bilateral radial, and ulnar joints; MTP, bilateral second through fifth MTP; Lisfranc, bilateral Lisfranc joints; Cuneonavicular, bilateral Cuneonavicular joints; Chopart, bilateral Chopart joints; ankle, bilateral ankle joints. There were no missing data in the US examination.

Methods
Patient information and blood test data in 2015 were also collected from the KURAMA cohort. Supplementary Table 1 shows the list of features. Cases with more than 80% of the missing features were eliminated. Missing values were complemented with each feature's median value. For replacing missing values on patients' height and weight, median values were calculated by gender. In total, 14 US examination data, 54 blood test data, and five data on patient information were available for analysis. For convenience, data on patient information were included in blood test data in this study.
Prediction models. Three ML classifiers (Logistic Regression, Random Forest, and XGBoost) were employed to predict RA patients' relapse. The logistic regression model is a generalized linear model and traditional approach for binary classification on clinical prediction. Random Forest is an ensemble algorithm that combines multiple decision trees to build a robust model 34 . It is widely used because of its high interpretability of prediction results. XGBoost is also a decision tree-based ensemble algorithm and achieves more accurate prediction utilizing gradient boosting 35 .
Predictive performance was assessed using the mean AUC by nested stratified six-fold cross-validation (CV). The inner loop, consisting of a three-fold CV, was used to select hyper-parameters by grid-search. The class balance option was set for all models to deal with imbalanced data.
For feature selection, we employed RFE, a method for extracting subsets of features that contribute to prediction performance by recursive processing. Since RFE allows us to set the size of the final feature subset, we varied the value within [5,10,20,30,50], finally selecting the number of features that showed the best AUC. Analyses and model constructions were performed with Python 3.8 packages (Scikit-learn 0.23 and XGBoost 1.1.1).
Ethical statements. This study was conducted following the principles set down in the Declaration of Helsinki and was approved by the ethics committee of Kyoto University (R0357). All patients provided written informed consent.